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ABSTRACT 

This paper aims at answering the following two questions in 
privacy-preserving data analysis and publishing: What formal pri- 
vacy guarantee (if any) does fc-anonymization provide? How 
to benefit from the adversary's uncertainty about the data? We 
have found that random sampling provides a connection that helps 
answer these two questions, as sampling can create uncertainty. 
The main result of the paper is that fc-anonymization, when done 
"safely", and when preceded with a random sampling step, satisfies 
(e, 5) -differential privacy with reasonable parameters. This result 
illustrates that "hiding in a crowd of fc" indeed offers some privacy 
guarantees. This result also suggests an alternative approach to out- 
put perturbation for satisfying differential privacy: namely, adding 
a random sampling step in the beginning and pruning results that 
are too sensitive to change of a single tuple. Regarding the second 
question, we provide both positive and negative results. On the pos- 
itive side, we show that adding a random-sampling pre-processing 
step to a differentially-private algorithm can greatly amplify the 
level of privacy protection. Hence, when given a dataset resulted 
from sampling, one can utilize a much large privacy budget. On the 
negative side, any privacy notion that takes advantage of the adver- 
sary's uncertainty likely does not compose. We discuss what these 
results imply in practice. 

1. INTRODUCTION 

In this paper we deal with the problem of using data in a privacy- 
preserving way. We consider the scenario where a trusted cura- 
tor obtains a dataset by gathering private information from a large 
number of respondents, and then make usage of the dataset while 
protecting the privacy of respondents. The curator may learn and 
release to the public statistical facts about the underlying pop- 
ulation. Alternatively, the curator may publish a sanitized (or, 
"anonymized") version of the dataset so that other parties can use 
the data to perform any analysis they are interested in. 

This paper aims at answering the following two questions in 
privacy-preserving data analysis and publishing. The first is: What 
formal privacy guarantee (if any) does fc-anonymization methods 
provide? fc-Anonymization methods have been studied extensively 
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in the database community, but have been known to lack strong pri- 
vacy guarantees. The second question is: How to benefit from the 
adversary's uncertainty about the data? More specifically, can we 
come up a meaningful relaxation of differential privacy |8 9| by 
exploiting the adversary's uncertainty about the dataset? We now 
discuss these two motivations in more details. 

The fc-anonymity notion was introduced by Sweeny and Sama- 
rati 1301 1291 127| |28i for privacy-preserving microdata publishing. 
This notion has been very influential. Many fc-anonymization 
methods have been developed over the last decades; it has also 
been extensively applied to other problems such as location pri- 
vacy 1 14|. The fc-anonymity notion requires that when only certain 
attributes, known as quasi-identifiers (QIDs), are considered, each 
tuple in a fe-anonymized dataset should appear at least fc times. In 
this paper, we consider a version of fc-anonymity which treats aU 
attributes as QIDs. We show that even satisfying this strong ver- 
sion of fc-anonymity does not protect against re-identification at- 
tacks. In addition, we identify the privacy vulnerabilities of ex- 
isting fc-anonymization algorithms. We then define classes of fc- 
anonymization algorithms that are "strongly-safe" and "e-safe", 
which avoid the privacy vulnerabilities of existing fc-anonymization 
algorithms. The question we aim to answer is whether these safe fc- 
anonymization methods would provide strong enough privacy guar- 
antee in practice. 

The notion of differential privacy was introduced by Dwork et 
al. fSI ll II . An algorithm A satisfies e-Differential Privacy (e-DP) 
if and only if for any two neighboring datasets D and D' , the dis- 
tributions of .4(D) and A{D') differ at most by a multiplicative 
factor of e''. A relaxed version of e-DP, which we use (e, (5)-DP to 
denote, allows an error probability bounded by 5. Satisfying differ- 
ential privacy ensures that even if the adversary has full knowledge 
of the values of a tuple t, as well as full knowledge of what other 
tuples are in the dataset, and is only uncertain about whether t is 
in the input dataset, the adversary cannot tell whether t is in the 
dataset or not beyond a certain confidence level. As in most data 
publishing scenarios, the adversary is unlikely to have precise infor- 
mation about all other tuples in a dataset. It is desirable to exploit 
this uncertainty to define a relaxed version of differential privacy, 
which can be easier to satisfy. 

We have found that sampling provides the link between our two 
goals. The main result in this paper is that sampling plus "safe" fc- 
anonymization satisfies (e, (5)-DP. This result leads us to study the 
relationship between sampling and differential privacy. We say that 
an algorithm satisfies differential privacy under sampling if the al- 
gorithm preceded with a random sampling step satisfies differential 
privacy. 

Results about differential privacy under sampling both are of the- 
oretical interest and have practical relevance. Sampling is a natu- 
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ral way to model the adversary's uncertainty about the data; thus 
this helps understand how to take advantage of this uncertainty in 
private data analysis. On the practical side, many data publishing 
scenarios already involve a random sampling step. Sometimes this 
sampling step is explicit, when one has a large dataset and wishes 
to release only a much smaller for research, such as the US census 
bureau's 1-percent Public Use Microdata Sample. Sometimes, this 
sampling step is implicit; because the respondents are randomly 
selected, one can view the dataset as resulted from sampling. 
The contributions of this paper are as follows: 

• We prove that safe fc-anonymization algorithm, when pre- 
ceded by a random sampling step, provides (e, (5)-differential 
privacy with reasonable parameters. 

In the literature, fc-anonymization and differential privacy 
have been viewed as very different privacy guarantees: fc- 
anonymization is syntactic, and differential privacy is algo- 
rithmic and provides semantic privacy guarantees. Our result 
is, to our knowledge, the first to link fc-anonymization with 
differential privacy. It illustrates that "hiding in a crowd of 
fc" indeed offers privacy guarantees. 

This result also provides a new way of satisfying differential 
privacy. Existing techniques for satisfying differential pri- 
vacy rely on output perturbation, that is, adding noise to the 
query outputs. Our result suggests an alternative approach. 
Rather than adding noise to the output, one can add a ran- 
dom sampling step in the beginning and prune results that 
are too sensitive to changes of individual tuples (i.e., tuples 
that violate fc-anonymity). 

• We show both positive and negative results on utilizing the 
adversary's uncertainty about the data. On the positive side, 
we show that random sampling has a privacy amplifica- 
tion effect for {e,5)-DP. For an algorithm that satisfies 
(e, (5)-DP, adding a sampling step with probability /3 reduces 
both e'^ — 1 and 5 by a factor of /3. For example, applying 
an algorithm that achieves (In 2 ~ 0.69)-differential privacy 
on dataset sampled with 0.1 probability can achieve overall 
(In 1.1 ~ 0.095)-differential privacy. 

On the negative side, we show that any privacy notion that 
exploits the adversary's uncertainty about the data is unlikely 
to compose, in the sense that publishing the output from two 
algorithms together may be non-private. 

Our results suggest the following approaches to take advan- 
tage of the fact that the input dataset is resulted from ex- 
plicit or implicit sampling. If one applies algorithms that 
satisfy (e, 5)-DP, then one can allow a larger privacy budget 
because of sampling. If one applies an algorithm that does 
not satisfy (e, 5)-DP, but satisfies (e, 5)-DP under sampling, 
then it is safe to apply the algorithm once. However, if one 
has a large dataset, one can repeated sample and then apply 
the algorithm on each newly sampled dataset. 

The rest of the paper is organized as follows. We study the re- 
lationship between differential privacy and sampling in Section |2] 
We study fc-anonymization and prove our main result in Section[3] 
We discuss related work in Section|4]and conclude in Section[5] An 
appendix includes proofs not found in the main body. 

2. DIFFERENTIAL PRIVACY UNDER 
SAMPLING 



2.1 Differential Privacy 

Differential privacy formalizes the following protection objec- 
tive: if a disclosure occurs when an individual participates in the 
database, then the same disclosure also occurs with similar prob- 
ability (within a small multiplicative factor) even when the indi- 
vidual does not participate. More formally, differential privacy re- 
quires that, given two input datasets that differ only in one tuple, the 
output distributions of the algorithm on these two datasets should 
be close. 

Definition 1 . [e-Differential Privacy ^ Hi} f e- D P)7; A ran- 
domized algorithm A gives e-dijferential privacy if for any pair of 
neighboring datasets D and D', and any O C Range(^), 

Pr[A{D) eO]<e Pt[A{D') G O] (1) 

Intuitively, e-DP offers strong privacy protection. If A satisfies 
e-DP, one can claim that publishing A{D) does not violate the 
privacy of any tuple t in D, because even if one leaves t out of 
the dataset, in which case the privacy of t can be considered to be 
protected, one may still publish the same outputs with a similar 
probability. 

In practice, e-DP can be too strong to satisfy in some scenarios. 
A commonly used relaxation is to allow a small error probability 5. 

Definition 2. [{e,S)-Differential Privacy ({e,5)-DP)]: 

A randomized algorithm A satisfies {e,S)-differential privacy, if 
for any pair of neighboring datasets D and D' and for any O C 
Range(^).- 

Pr[A{D) G O] < ePr[A{D') G O] + <5 

Existing methods to satisfy differential privacy includes adding 
Laplace noise proportional to the query's global sensitivity fSl ll II . 
adding noise related to the smooth bound of the query's local sensi- 
tivity 1 26 1, and the exponential mechanism to select a result among 
all possible results |25 1. 

2.2 Uncertain Background Knowledge 

One of our goals is to develop a further relaxation of differen- 
tial privacy that can be more easily satisfied. The intuition that we 
wanted to exploit is the adversary's uncertainty about the under- 
lying dataset. The (e, 5)-DP notion ensures that when an adver- 
sary is uncertain about whether one tuple t is present in the input 
dataset, even when the adversary knows the precise information all 
other tuples in the input dataset, the adversary cannot tell based on 
the output whether t is in the input or not. We believe that it is 
reasonable to relax the assumption to that the adversary knows all 
attributes of a tuple t (but not whether t is in the dataset), and in 
addition statistical information about the rest of the dataset D. The 
privacy notion should prevent such an adversary from substantially 
distinguishing between D and D U {t\ based on the output. 

The desire to exploit adversary's uncertainty is shared by other 
researchers. For example, Adam Smith's blog post summarizing 
the Workshop on Statistical and Learning-Theoretic Challenges in 
Data Privacy includes a section on relaxed definitions of privacy 
with meaningful semantics: "it would be nice to see meaningful 
definitions of privacy in statistical databases that exploit the adver- 
sary's uncertainty about the data. The normal approach to this is to 
specify a set of allowable prior distributions on the data (from the 
adversary's point of view). However, one has to be careful. The 
versions I have seen are quite brittle.'Q 

9 ' http.V/adamdsmith. wordpress. com/201 0/03/04/ipam- 
workshop-wrap-up/ 



2 



Some degree of brittleness may be unavoidable. It appears that 
any privacy notion that takes advantage of the adversary's uncer- 
tainty about the data is not robust under composition, which re- 
quires that given two algorithms that both satisfy the privacy no- 
tion, their composition, i.e., applying both algorithms to the same 
input dataset and then publish both outputs, also satisfies the pri- 
vacy notion. 

Consider the following two algorithms. Let r{D) be the predi- 
cate that D contains an odd number of tuples, and s(D) be a sen- 
sitive predicate, e.g., whether a tuple t is in D. Algorithm Ai{D) 
outputs r(D), and A2{D) outputs r{D) XOR s{D). Both Ai and 
A2 should satisfy a privacy notion that assumes that the adversary 
is uncertain about the data, because there is no reason that the ad- 
versary should know the exact number of the tuples. However, the 
composition of Ai and A2 leaks r{D). More generally, for any 
privacy definition that exploits the adversary's uncertainty about 
data, there exists at least one predicate that the adversary is uncer- 
tain about. Then one algorithm can output that predicate, and a 
second algorithm can output that predicate XOR's with a predicate 
that results in privacy leakage; and they does not compose. 

The above observation suggests that no such definition should 
be used in the interactive setting of answering multiple queries. If, 
however, one intends to publish a dataset in the non-interactive set- 
ting only once, then the inability to compose may be an acceptable 
limitation. 

2.3 Differential Privacy under Sampling 

One natural approach to capturing the adversary's uncertainty 
about the input data is to add a sampling step. We introduce the fol- 
lowing definition, called (/3, e, 5) -Differential Privacy under Sam- 
pling ((^, e, 5)-DPS for short). 

Definition 3 (Differential privacy under sampling). 
An algorithm A gives (/?, e, (5)-DPS if and only if 13 > S and the 
algorithm A^ gives (e,5)-DP, where A^ denotes the algorithm 
to first sample with probability P ( include each tuple in the input 
dataset with probability 13), and then apply A to the sampled 
dataset 

The above definition requires j3 > 5 because any algorithm triv- 
ially satisfies 0,5)-DPS when 13 < 5. This is because when 
two datasets differ only by one tuple, sampling from them with the 
probability j3 will result in exactly the same output with probabil- 
ity 1 — /3. However, when /? ^ 5, the notion of {{3, e, 5)-DPS is 
both nontrivial to satisfy and a nontrivial relaxation of (e, 5)-DP, 
as shown by our results in Section [3] There we show that existing 
fc-anonymization algorithms do not satisfy (/3, e, 5)-DPS, and have 
privacy vulnerabilities, and that safe (and possibly deterministic) 
fc-anonymization satisfies (/3, e, 5)-DPS, while violating (e, 5)-DP 
for any 5 < 1. 

2.4 The Amplification Effect of Sampling 

An interesting feature of the (/?, e, 5)-DPS notion is that there is 
a connection between the privacy parameters e, S and the sampling 
rate /?. The following theorem shows that by employing a smaller 
sampling rate, one can achieve a stronger privacy protection (i.e., 
smaller values for e and 5). 

Theorem 1. Any algorithm that satisfies ei, 5i)-DPS 
also satisfies {132, £2, 52)-DPS for any j32 < f3i, where 

£2 = In (1 + (If (e-'i - 1)) j, andS2 = ff<Si. 
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Table 1: Effect of privacy parameters under sampling. 

An equivalent way to write £2 = In ^1 + (^^{^'^^ ~ ^))) ^' 
- 1 
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In other words, by decreasing the sampling probability, one obtains 
proportional decreases in e' — 1 and S, improving the privacy pro- 
tection. Hence, when one possesses a randomly sampled dataset, 
then one can use much relaxed privacy budget e and error tolera- 
tion 5. To see the effects of this, in Table [T] we show the privacy 
parameters for an algorithm that satisfies (In 11, 10~^)-DP, and an 
algorithm that satisfies (1, 0)-DP under sampling rate 0.1 and 0.01. 

Smith's blogQ includes an "amplification" lemma for differential 
privacy, which was used implicitly in the design of a PAC learner 
for the parity class in 1 17 1. The lemma states that an algorithm that 
satisfies (e — 1)-DP, when preceded by random sampling with rate 
/?, satisfies (2/3)-DP. Theorem[T]exploits similar observations, but 
is more general in that it applies to (e, (5)-DP, rather than e-DP, and 
that it also applies to arbitrary values of e. Our result is also slightly 
tighter; for example, for the special case of e = 1 and 13 — 0.1, we 
give a result of 0.159 as opposed to 2/3 — 0.2. 

2.5 Properties of {p, e, 5)-dps 

While the (/3, e, 5)-DPS notion does not compose. It does have 
several other desirable properties. In 1191 , Kifer and Lin identi- 
fied two privacy axioms when they defined the generic differential 
privacy. The Transformation Invariance axiom states that given 
an algorithm A that satisfies a privacy notion, adding any post- 
processing step operating on A's output should still satisfy the pri- 
vacy notion. The Privacy Axiom of Choice axiom states that given 
two algorithms ^1 and A2 that both satisfy a privacy notion, then 
a new algorithm that chooses Ai with probability p and A2 with 
probability 1 — p should also satisfy the notion. We now show that 
e, 5)-DPS satisfies both axioms. 

Theorem 2. Given Ai that satisfies (/3, e, (5)-DPS and any al- 
gorithm A2, A{D) = A2{Ai{D)) satisfies {13, e, 5)-DPS. 

Proof. Assume, for the sake of contradiction, that A{D) does 
not satisfy (/?, e, 5)-DPS, then there exist neighboring D and D' 
and O C Range(yl2) such that 

Vt[A2{A'1{D)) eO]> e Vt[A2{A'1{D')) G O] +<5 

Consider all 5"s in Range(y4i), let q{S) = Vy[A2{S) G O], 
andletp(S) = Pr[y4f(I>) = S] andp'(5') = Pr[y4f(_D') = 5"]. 
Then we have 

^ p{S)q{S)>e^ J2 p{S)q{S) + S. 

SGRange{^l) SGRange{^i) 

We partition Range(yli) into 5i = {5" | p{S) > e'p'(5')} and 
<S2 = {5* I p{S) < e'^p'{S)}. Rewriting the above inequality, we 



See Appendix |A. ll for the proof. 



9 http.V/adamdsmith. wordpress. com/2009/09/02/sample- 
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have 

Es65iP(5)9(^)+Es65.P(5)g(5) 

> P'iS)q{S) + e' Es65. P'iS)qiS) + S 

Consider the sum over S2, we have 

Subtracting the above from previous, we have 

Ep(5M5)>e^ Ep'(^M^) + ^ 

For each S G Su we have p(S')(l - g(5')) > e'-p'{S){l - qiS)), 
and thus 

E P(S){1 - q{S)) > E - ^(^)) 

seSi Se5i 

Summing up the above two inequalities, we have 

Ep(S)>e' J2p'iS) + S 
seSi seSi 

This contradicts that Ai satisfies (/3, e, (5)-DPS. □ 

Theorem 3. Given two algorithms Ai and A2 that both sat- 
isfy (/3, e, (5)-DPS, for any p G [0, 1], let A-p{D) be the algorithm 
that outputs Ai (D) with probability p and A2 (D) with probability 
1 — p, then Ap satisfies (/3, e, (5)-DPS. 

Proof. Since both Ai and A2 satisfy {j3,e,5)-DPS, for 
any pair of neighboring datasets D and D' and for any O G 
Range(^i) U Range(^2), we have 

PiiAp{D) G O] 
= pPr[Ai{D) eO] + {l~p)Pr[A2{D) eO] 
< p{e' Pt[Ai{D') G O] +<5) + {l-p){e'PT[A2iD') e O] + S) 
= e^{pPi[Ai{D') G O] + {l-p)PT[A2iD') eO])+S 
= e'Pr[Ap{D') €0] + S. 

Therefore, the algorithm Ap also satisfies (/3, e, (5)-DPS. □ 

2.6 More Non-Composability 



From observations in Section [Z2l we expect that (/?, e, 5)-DPS 
does not compose. However, one would expect that combin- 
ing an algorithm that satisfies ei, (5)-DPS and one that satis- 
fies £2-0 P should result in an algorithm that satisfies the weaker 
e, 5)-DPS, where e is some function of ei and 62- Such a 
weaker form of composability is useful in that given a dataset that is 
resulted from random sampling, one can publish it in a way that sat- 
isfies (/3, e, (5)-DPS, while at the same time answering queries us- 
ing mechanisms that satisfy e-DP. Surprisingly, even such a weak 
form of composability does not hold. 

Consider the following two algorithms operating on datasets in 
which each tuple has two fields: gender and name. Let r{D) be 
the predicate that D contains more male than female, and s{D) be 
a sensitive predicate, such as whether D contains a specific tuple. 
The algorithm Ai (D) outputs r{D) XOR s{D) when D contains a 
sufficient number of tuples (say, 1000), and outputs false otherwise. 
And A2{D) outputs the percentage of tuples in D that are male 
with Laplacian noise 111]. 

Clearly A2{D) satisfies e-DP. Ai satisfies e, (5)-DPS for 
any /? that is not too close to 1. Let T and T' be the random vari- 
ables resulted from sampling from D and D' respectively. Only 
when the dataset size is large enough, would Ai output informa- 
tion that depends on the input data. When D and D' contain a large 



number of tuples and differ only by one, r{T) and r(T') have es- 
sentially the same distribution, taking the value true with probabil- 
ity very close to 0.5, making Ai{T) and Ai{T') having a similar 
distribution. Combining Ai and A2, however, is non-private. Us- 
ing A2 (T) one obtains a highly accurate estimate of the predicate 
r(T), enabling the adversary to learn s{T) with high probability. 

More specifically, let D and D' be two datasets such that s{D) 
is false, s{D') is true (i.e., D' contains the tuple we are checking), 
and they each contain 10,000 tuples, half male and half female. 
Consider sampling probability /3 = 0.5, and the event that Ai out- 
puts false, and A2 outputs p > 0.5. Let T and T' be the random 
variables resulted from sampling from D and D' respectively, then 
we have 

Pr[s(r) = true] = 
Pr[s(T') = true] = 1/2 
Pr[r(r) = true | A2{T) > 0.5] ^ 1, 
Pr[r(T') = true | A2{T') > 0.5] ^ 1 



and 



Pr[A(T) > 0.5 A ^i(r) = false] 

Pt[A2{T) > 0.5] Pr[r(r) = s(T) | A2{T) > 0.5] 

Pr[A2{T) > 0.5] Pr[s(T) = true] 

0, 



while 



Pr[A2{T') > 0.5 A Ai{T') = false] 
^ Pr[A2{T') > 0.5] Pr[s(T') = true | A2{T') > 0.5] 
= Pr[^2(r') > 0.5] Pr[s(T') = true] 
^ 1/4. 

This result is somewhat surprising. After all, any mechanism 
that satisfies e-DP should not be leaking private information about 
the underlying datasets. How could adding a differentially private 
mechanism destroys the privacy protection of another mechanism? 
Our understanding is that satisfying e, 5)-DPS can be achieved 
by relying on the adversary's uncertainty. The adversary knows 
only that the dataset is from a large set of candidates. While e-DP 
ensures that adjacent datasets are difficult to distinguish, these can- 
didates are not all adjacent and can indeed be quite far apart. Hence 
obtaining one e-DP answer may dramatically change the probabil- 
ity of which candidates are possible, removing some degree of un- 
certainty, destroying any privacy protection that relies on exactly 
that uncertainty. 

This inability for a e, 5)-DPS mechanism to compose with a 
e-DP mechanism suggests that e, (5)-DPS mechanisms should 
be applied alone. Hence they are not suitable for the interactive 
mode, but only suitable for the non-interactive mode of data pub- 
lishing. Furthermore, it also suggests that mechanisms satisfying 
e-DP should be used carefully as well, as its output may break 
other mechanisms' (albeit weaker) privacy guarantees. 

2.7 Benefiting from Sampling 

We observe that in many data publishing scenarios, random sam- 
pling is an inherent step. For example, the census bureau publishes 
a 1 -percent microdata sample. In many research settings (such as 
when Netflix wants to publishing movie ratings), it is sufficient to 
publish a random sample of the dataset. Many times, even when 
the dataset is not the result of explicit sampling, one can view it 
as result of implicit sampling, because the process of selecting re- 
spondents involves randomness. 

The natural question is how one can benefit from such ex- 
plicit or implicit sampling. Our results provide the following an- 
swers. The first way is to limit oneself to mechanisms that satisfy 
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(e, (5)-DP, then the uncertainty resulted from sampling enables one 
to use a larger privacy budget because of the amplification result 
in Theorem [T] The second way is to use a mechanism that does 
not satisfy {e,S)-DP, but satisfies (/?, e, 5)-DPS, such as safe k- 
anonymization, which we will study in Section [S] However, this 
way of benefiting from sampling can be enjoyed only once; one 
cannot use the same dataset to answer other queries, even when 
using mechanisms that satisfy e-DP. 

There, however, does exist a more flexible way to use a mech- 
anism that satisfies only e, 5)-DPS. When one has a large 
dataset, one can sample a dataset, apply the mechanism, publish 
the result, and discard the intermediate sampled dataset. Because 
of the composability of (e, 5)-DP, this approach can be applied 
multiple times so long as each time one performs a fresh sampling. 
One can also use multiple mechanisms that satisfy (e, S)-DP on a 
newly sampled dataset. 

We point out that the benefit of sampling should not be viewed 
as just "throwing away data"; sampling's main benefit is to intro- 
duce uncertainty. Given a dataset, one could sample with, say, 
/3 = 0.2 for many (say, 50) times, and apply a mechanism that 
satisfies (0.2, 0.02, 0)-DPS to each sampled dataset and publish 
the results. With high probability, each tuple is included in at least 
one of the sampled datasets. That is, in some sense, no tuple is 
thrown away. However, as each sampling and publishing satisfies 
(e, (5)-DP, and (e, (5)-DP composes, publishing the 50 outputs still 
satisfies (e, S)-DP for e = 1, J = 0. 

In summary, sampling creates uncertainty for the adversary. 
While the benefit due to this uncertainty is easy to lose because 
the uncertainty can be jeopardized by answering any query on it, 
this uncertainty is also easy to gain, as each sampling introduces 
fresh uncertainty. 

3. SAFE A-ANONYMIZATION MEETS 
DIFFERENTIAL PRIVACY 

In this section we show that fc-anonymization, when per- 
formed in a "safe" way, satisfies (/?, e, 5)-DPS. That is, safe k- 
anonymization, when preceded by a random sampling step, satis- 
fies (e, 5)-differential privacy. 

3.1 An Analysis of fc- Anonymity 

The development of fc-anonymity was motivated by a well pub- 
licized privacy incident |30|. The Group Insurance Commission 
(GIG) published a supposedly anonymized dataset recording the 
medical visits of patients managed under its insurance plan. While 
the obvious personal identifiers (such as name and address) were re- 
moved, the published data included zip code, date of birth, and gen- 
der, which are sufficient to uniquely identify a significant fraction 
of the population. Sweeney |30| showed that by correlating this 
data with the publicly available Voter Registration List for Cam- 
bridge Massachusetts, medical visits for many individuals can be 
easily identified, including those of William Weld, a former gov- 
ernor of Massachusetts. We note that even without access to the 
public voter registration list, the same privacy breaches can occur. 
Many individuals' birthdate, gender and zip code are public infor- 
mation. This is especially the case with the advent of social media, 
including Facebook, where users share seemingly innocuous per- 
sonal information to the public. The GIG re-identification attack 
directly motivated the development of the fc-anonymity privacy no- 
tion. 

Definition 4. [k- Anonymity, the privacy notion] [Wj: A 

published table satisfies k-anonymity relative to a set of QID at- 



tributes if and only if when the table is projected to include only the 
QIDs, every tuple appears at least k times. 

Quasi-identiflers vs. Sensitive Attributes? A first problem 
with Definition |4] is that it requires the division of all attributes 
into quasi-identifiers (QIDs) and sensitive attributes (SA), where 
the adversary is assumed to know the QIDs, but not SAs. This sep- 
aration, however, is very hard to obtain in practice. Even though 
only some attributes are used in the GIG incident, it is difficult to 
assume that they are the only QIDs. Other attributes in the GIG 
data include visit date, diagnosis, etc. There may well exist an ad- 
versary who knows this information about some individuals, and if 
with this knowledge these individuals' record can be re-identified, 
it is still a serious privacy breach. 

The same difficulty is true for publishing any kind of census, 
medical, or transactional data. When publishing anonymized mi- 
crodata, one has to defend against all kinds of adversaries, some 
know one set of attributes, and others know different sets. An at- 
tribute about one individual may be known by some adversaries, 
and unknown (and should be considered sensitive) for other adver- 
saries. 

Any separation between QIDs and SAs is essentially making as- 
sumptions about the adversary's background knowledge that can 
be easily violated, rendering any privacy protection invalid. Hence 
we consider a strengthened version of fc-anonymity by treating all 
attributes as QIDs. This is stronger than using any subset of at- 
tributes as QIDs. This strengthened version of fc-anonymity avoids 
making assumption about the adversary's background knowledge 
about which attributes are known and what are not. This has been 
used in the context of anonymizing transaction data 

Weakness of the fc-Anonymity Notion. With the strengthened 
version of fc-anonymity, one might expect that it should stop re- 
identification attacks. To satisfy this notion, each tuple in the out- 
put is blended in a group of at least fc tuples that are the same. 
This follows the appealing principle that "privacy means hiding in 
a crowd". The intuition is that as there are at least fc — 1 other tuples 
that look exactly the same, one cannot re-identify which tuple in the 
output corresponds to an individual with probability over 1/fc. Un- 
fortunately, this intuition turns out to be wrong. Only making the 
syntactic requirement that each tuple appears at least fc times does 
not protect privacy, as a trivial way to satisfy this is to select some 
tuples from the input and then duplicate each of them fc times. 

Several other privacy notions have been introduced on the mo- 
tivation that fc-anonymity is not strong enough. Among these are 
^-diversity [23] and t-closeness f22]. In these approaches, it is ob- 
served that even if fc-anonymity is achieved, information about sen- 
sitive attributes can still be learned, perhaps due to the uneven dis- 
tribution of their values. This line of work, however, still requires 
the problematic assumption that there is a separation between QIDs 
and SAs, and that the adversary knows only the QIDs. In other 
words, while they correctly assert that fc-anonymity is not strong 
enough, these definitions did not fix it in the right way. 

fc-Anonymity vs. fc-Anonymization Algorithms. Here we would 
like to make a clear distinction between the k-anonymity, the pri- 
vacy notion, and k-anonymization algorithms. 

Many fc-anonymization algorithms have been developed in the 
literature. Given input datasets, they aim at producing anonymized 
versions of the input datasets that satisfy fc-anonymity. That the 
fc-anonymity privacy notion is weak means that producing outputs 
that satisfying k-anonymity alone is insufficient for privacy pro- 
tection. However, this does not automatically mean that all fc- 
anonymization have privacy vulnerabilities. We now show that 
the algorithms that have been developed in the literature are in- 
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deed vulnerable to re-identification attacks. Consider the follow- 
ing anonymization scheme, which represents several proposed al- 
gorithms for fc-anonymity |5 , 21 1. 

Algorithm 1. [Clustering and Local Recoding (CLR)]: 

First, group input tuples into clusters such that each cluster has 
at least k tuples. For example, one method of grouping is the Mon- 
drian algorithm [21]. One could also use some clustering method 
based on some distance measurement (e.g., Then, for each 

tuple, replace each attribute value with a generalized value that 
represents all values for that attribute in the cluster. 

CLR algorithms are vulnerable when some tuples contain ex- 
treme values. Even if the output satisfies fc-anonymity, the gener- 
alized value depends on the extreme values of some tuples; hence 
from the output an adversary can infer that one's tuple is in the 
dataset and can thus infer these values. For example, suppose 
the dataset records the net worth of some individuals in a town. 
Further suppose that it is known that only one individual in the 
town has net worth over $10 million. When given a (fc = 20)- 
anonymized output dataset containing one group of tuples that all 
have [QOOAT, 35Af ] as the generalized net worth value, what can 
one conclude? At least the following: the rich individual is in the 
dataset; the individual's tuple is in the group; and the individual's 
net worth is $35 million. It would be difficult to say that because in 
the output dataset, there are at least 19 other tuples that are exactly 
the same, then the individual cannot be re-identified with probabil- 
ity 1/20. 

Similar weaknesses exist for other fc-anonymization algorithm 
in the literature, for example, those computing a generalization 
scheme based on the input dataset 1 16 |. With all these algorithms, 
the presence and non-presence of some extreme values will affect 
the resulted generalization scheme, leaking information. 

As these algorithms are sensitive to the presence of a single tuple 
with extreme values, they do not satisfy e, 5)-DPS when /? > 5, 
since sampling with jS will result the presence of the tuple selected 
with probability j3. 

3.2 Towards "Safe" fc-Anonymization 

We have shown that fc-anonymity (even when all attributes are 
treated as QID) does not provide adequate protection, nor do ex- 
isting fc-anonymization algorithms. One natural question is: Is this 
because the intuition "hidden in a crowd" fails to provide privacy 
protection, or is it because the definition of fc-anonymity fails to 
correctly capture "hidden in a crowd"? 

We believe that the answer is the latter. The notion of fc- 
anonymity implicitly assumes that there is a one-to-one relation g 
between the input tuples and the output tuples, i.e., given input D, 
the output dataset is {g{t) \ t £ D}. When there are fc output tu- 
ples that are the same, there must exist k input tuples that are indis- 
tinguishable based only on their corresponding outputs. However, 
this relation g itself can be overly dependent on one or a few input 
tuples. For instance, consider the example above with the extreme 
value. Choosing [9QQif , 35A/] as the generalized value depends on 
the single input tuple with value 35Af ; hence all tuples that contain 
this generalized value are directly affected by one tuple's presence, 
and the tuple is not really "hiding in a crowd". 

An intriguing question is: If a fc-anonymization algorithm uses a 
mapping that does not overly depend on any individual tuple, does 
such an algorithm provide an adequate level of privacy protection? 
To answer this question, we first formalize such algorithms as safe 
fc-Anonymization algorithms. 

Intuitively, an fc-anonymization algorithm A takes as input a 
dataset D and a value fc and produces an output dataset S = A{D). 



In order to define "safe" anonymization algorithms, we require 
each anonymization algorithm A to be specified in two steps. The 
first step. Am, outputs a mapping function g : D T, where T is 
the set of all possible tuples. The second step applies g to all tuples 
in D. That is, A{D, fc) = App\y{Am{D, k),D, fc), where Apply 
is defined as follows. 



Apply(3,D,fc) 

for allteD do 

S ^SUg{t) 
end for 

for all seSdo 

if s appears less than fc times in S then 
remove all occurrences of s from S 

end if 
end for 
return S 



We note that all existing fc-anonymization algorithms can be 
modeled this way, as there is no limitation on the the form of Am's 
output g. In the extreme case, g can be described as a table match- 
ing each tuple in D to the desired output tuple. 

Definitions (Strongly-Safe Anonymization). We 
say that a k-anonymization algorithm A is strongly safe if and only 
if the function Am.{D,k) is remains constant when D changes, 
i.e., the mapping g does not depend on its input dataset. 

An example of a strongly-safe fc-anonymization algorithm is to 
always use the same global recoding scheme no matter what dataset 
is the input. 

Intuitively a strongly-safe fc-anonymization algorithm provides 
some level of privacy protection, and the level of privacy protec- 
tion increases with larger values of fc. If any individual's tuple is 
published, there must exist at least fc — 1 other tuples in the input 
database that are the same under the recoding scheme; furthermore, 
the recoding scheme does not depend on the dataset, and one sees 
only the results of the recoding. Hence in this input dataset, the in- 
dividual is hidden in a crowd of at least fc. However, the following 
proposition shows that strongly safe fc-anonymization algorithms 
do not satisfy (e, (5)-DP. 

Proposition 4. No strongly-safe k-anonymization algorithm 
satisfies (e, 5)-DP for any 5 < 1. 

Proof. Given a strongly-safe algorithm A, let g be the map- 
ping A uses. Choose D and D' that differ in one tuple t and D con- 
tains n > k tuples t' such that g{t') = g{t). The dataset D' con- 
tains n—1 copies of such t' . Then, A{D) and A{D') contain differ- 
ent numbers of g{t). Let S = A{D), we have Pr[A{D) = S] = 1 
and Pr[AiD') ^ S] = 0. □ 

3.3 Privacy of Strongly-Safe fc- 
Anonymization 

We now show that strongly-safe fc-anonymization algorithm sat- 
isfies e, 5)-differential privacy for a small 5 with reasonable val- 
ues of fc and /3. We use f(j; n, jS) to denote the probability mass 
function for the binomial distribution; that is, f{j; n, j3) gives the 
probability of getting exactly j successes in n trials where each 
trial succeeds with probability /?. And we use F{j; n, /?) to de- 
note the cumulative probability mass function; that is, F{j; n, /3) = 

ELo/(^;":/3)- 
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Theorem 5. Any strongly-safe k-anonymization algorithm 
satisfies {P, £,S) -DPS for any < /3 < 1, e > - ln(l - 
and S — d{k, /?, e), where the function d is defined as 

n 

d{k,P,e) = max ^ f{j;n,/3), 
where 7 = (° 

See Appendix |Aj2] for the proof. 

The function d relates the four parameters e, P, k, 5 by requiring 
S = d{k, P, e). Note that the other requirement is that e > — bi(l — 
P). Among the four parameters, e and S define the level of privacy 
protection, while k and P affect the quality of anonymized data. 
We now examine the relationships among these four parameters. 

To compute this, we want to find n > — ij that maximizes 
X]j>7,i /Oi "-I P)- We first observe that 7 > /3 because 

-y _/3 = (<='-l + f>) ^ (■="-1X1-/9) > 

That is, X]j>7n f(j'' ^1 P) sums up the tail binomial distribution 
probabilities for the portion of the tail beyond 772, as shown in Fig- 
ure[T] Following the intuition behind the law of large numbers, the 
larger the value of n, the smaller this tail probability. Hence intu- 
itively, choosing the smallest value of n, i.e., n — rim ~ \^ ~ ^ 
should maximize the formula. Unfortunately, due to the discrete 
nature of the binomial distribution, the maximum value may not be 
reached at Um, but instead at one of the next few local maximal 
points — ij , |"^^ — l"! , ■ ■ ■ . Thus we are unable to further 

simplify the representation of the function d(k, P, e). 

We now report the relationships among e, P, k, 5 using numerical 
computation. In Tabled we fix = 20 and report the values of S 
under different e and P values. The table shows that the values of S 
can be very small. We note that with fixed k and /3, 5 decreases as e 
increases, which states that the error probability gets smaller when 
one relaxes the e-bound on the probability ratio. In other words, 
the more serious a privacy breach, the more unlikely it occurs. The 
table also shows that with fixed k and e, S decreases as P decreases, 
meaning that a smaller sampling probability improves the privacy 
protection. 

In Figure[2l we show the results from examining the relationship 
between e and S when we vary k G {5, 10, 20, 30, 50} under fixed 
P = 0.2. We plot I against e for values of e > — ln(l — P). The 
figure indicates a negative correlation between e and S. Further- 
more, increasing k has a close to exponential effect of improving 
privacy protection. For example, when e — 2, increasing k by 10 
roughly decreases S by 10~^. 

In Figure [3] we show the results from examining the effect 
of varying P G {0.05,0.1,0.2,0.3,0.4} under a fixed value of 
k = 20. This shows that decreasing P also dramatically improve 
the privacy protection. The two figures indicate the intricate rela- 
tionship between privacy and utility. 

In Figure [4] we explore this phenomenon that increasing k and 
decreasing P both improve privacy protection. Starting from (fc = 
15, P — 0.05), each time we double P and find a value k that gives 
a similar level of privacy protection. We finds that k increases from 
15 to 22 (for P = 0.1), 35 (for P = 0.2), and 60 (for P = 0.4). 

In Figure|5] we examine the quality of privacy protection for very 
small fc's (from 1 to 5). We choose a very small sampling proba- 
bility of P — 0.025. Not surprisingly, when fe = 1, the privacy 
protection is entirely from the sampling effect, as the obtained S 
value is less than p. However, when > 2, we start seeing privacy 
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Figure 1: A graph showing the relationship between Pn and 772 
on a binomial curve 
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Figure 2: A graph showing the relationship between e and j if 
we vary the values of k under fixed p 

protection effect from fc-anonymization, with 5 (< 0.001) signifi- 
cantly smaller than P = 0.025 when e = 2. 

Finally, in Figure[6]we show the relationship between the privacy 
parameter e and the utility parameter k if we set the requirement 
that 5 < 10^®. The figure shows that smaller values of e can be 
satisfied for larger values of k. Furthermore, the effect of P over e 
is quite substantial. 

3.4 e-Safe fc-Anonymization 

In practice, requiring a fc-anonymization algorithm to be strongly 
safe is likely to result in outputs that have low utility. We now relax 
this requirement to allow the generalization scheme to be chosen in 
a way that depends on the input dataset, but does not overly depend 
on any individual tuple. 

Definition 6 (e-S afe fc- Anonymization). We say that a 
k-anonymization algorithm A- is e-safe if and only if the function 
Am satisfies e-DP. 

One possible approach to do this is to consider various possible 
generalization schemes, uses a quality function to assign a quality 
to each of them, and then uses the exponential mechanism 1 25) to 
select in a differentially private way a generalization scheme that 
gives good utility. 
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Table 2: A table showing the relationship between /3 and e in determining the value of 5 when k is fixed. In the above k = 20, and 
each cell in the table reports the value of S under the given values of fi and e 




Figure 3: A graph showing the relationship between e and i if Figure 5: A graph showing the relationship between e and i 
we vary the values of (3 under fixed k. with small fc's, varying k and fixing (5. 




Figure 4: A graph showing the relationship between the values Figure 6: A graph showing the value of e satisfied by a given k 
of k needed to achieve roughly the same 5 if we double if 5 < 10~® with varying sampling probabilities. 
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The following theorem shows that e-safe fc-Anonymization also 
satisfies (/3, e, (5)-DPS. 

Theorem 6. Any ei-safe k-anonymization algorithm satisfies 
{13, e, (S)-DPS, where e > - ln(l - /3) + ei, 5 = d{k, 13, e- ei) = 

max E;>,„ /(j; n, /?), 7 = ^^^9^. 
See Appendix lA.3l for the proof. 

3.5 Remarks of the Result 

Theorems [5] and |6] show that fc-anonymization, when done 
safely, and when preceded by a random sampling step, can sat- 
isfy (e, 5)-DP with reasonable parameters. In the literature, k- 
anonymization and differential privacy have been viewed as very 
different privacy guarantees, fc-anonymization achieves weak syn- 
tactic privacy, and differential privacy provides strong semantic pri- 
vacy guarantees. Our result is, to our knowledge, the first to link 
fc-anonymization with differential privacy. This suggests that the 
"hiding in a crowd of k" privacy principle indeed offers some pri- 
vacy guarantees when used correctly. We note that this principle is 
used widely in contexts other than privacy-preserving publishing of 
relational data, including location privacy and publishing of social 
network data, network packets, and other types of data. 

We also observe that another way to interpret our result is that 
this provides a new method of satisfying (e, S)-DP. Existing meth- 
ods for satisfying differential privacy include adding noise accord- 
ing to the global sensitivity (8), adding noise according to the 
smooth local sensitivity [261, and the exponential mechanism | 25| 
which directly assigns probabilities to each possible answer in the 
range. Our result suggests an alternative approach: Rather than 
adding noises to the output, one can add a random sampling step in 
the beginning and prune results that are too sensitive to changes of 
a single individual tuple (i.e., tuples that violate fc-anonymity). In 
other words, when the dataset is resulted from random sampling, 
then one can answer count queries accurately provided that the re- 
sult is large enough. An intriguing question is whether other input 
perturbation techniques can be used to satisfy differential privacy 
as well. 

4. RELATED WORK 

A lot of work on privacy-preserving data publishing considers 
privacy notions that are weaker than differential privacy. These ap- 
proaches typically assume an adversary that knows only some as- 
pects of the dataset (background knowledge) and tries to prevent it 
from learning some other aspects. One can always attack such a pri- 
vacy notion by changing either what the adversary already knows, 
or changing what the adversary tries to learn. The most prominent 
among these notions is fc-anonymity 13011291 . Some follow-up no- 
tions include Z-diversity |23 | and t-closeness 1221 . In this paper, we 
analyze the weaknesses of fc-anonymity in detail, and argue that a 
separation between QIDs and sensitive attributes are difficult to ob- 
tain in practice, challenging the foundation of privacy notions such 
as /-diversity, t-closeness, and other ones centered on attribute dis- 
closure prevention. 

The notion of differential privacy was developed in a series of 
works (7] [TS] |3] [TT] [§]. It represents a major breakthrough in 
privacy-preserving data analysis. In an attempt to make differen- 
tial privacy more amenable to more sensitive queries, several relax- 
ations have been developed, including (e, 5) -differential privacy |7 
|13ll3] [m . Three basic general approaches to achieve differential 
privacy are adding Laplace noise proportional to the query's global 
sensitivity f8"TT|, adding noise related to the smooth bound of the 
query's local sensitivity L261 , and the exponential mechanism to 



select a result among all possible results 1251 . A survey on these 
results can be found in |9|. Our approach suggests an alterative 
by using input perturbation rather than output perturbation to add 
uncertainty to the adversary's knowledge of the data. 

Random sampling |1 2| has been studied as a method for pri- 
vacy preserving data mining, where privacy notions other than dif- 
ferential privacy were used. The relationship between sampling 
and differential privacy has been explored before. Chauduri and 
Mishra [6| studied the privacy effect of sampling, and showed a 
linear relationship between the sampling probability and the error 
probability S. Their result suggests an approach to perform first 
fc-anonymization and then sampling as the last step. We instead 
consider the approach of perform sampling as the first step and 
then fc-anonymization. Our result suggests that the latter approach 
benefits much more from the sampling. 

There exists some work on publishing microdata while satisfy 
(e, 5)-DP or its variant. Machanavajjhala et al. |24| introduced a 
variant of {e,S)-DP called (e, 5) -probabilistic differential privacy 
and showed that it is satisfied by a synthetic data generation method 
for the problem of releasing the commuting patterns of the popu- 
lation in the United States. This notion is stronger than (e, 5)-DP. 
Korolova et al. (201 considered publishing search queries and clicks 
that achieves (e, 5) -differential privacy. A similar approach for re- 
leasing query logs with differential privacy was proposed by Gotz 
et al. |15|. These approaches apply the output perturbation tech- 
nique in differential privacy to microdata publishing scenarios that 
can be reduced to histogram publishing at their core. Blum et al. I?) 
and Dwork et al. [ 12] considered outputing synthetic data genera- 
tion that is useful for a particular class of queries. These papers do 
not deal with the relationship between fc-anonymization and differ- 
ential privacy, or between sampling and fc-anonymization. 

Kifer and Lin |T9) developed a general framework to charac- 
terize relaxation of differential privacy. They identified two axioms 
for a privacy definition: Transformation Invariance and Privacy Ax- 
iom of Choice, which are satisfied by (/?, e, (5)-DPS. They did not 
consider the composability of these notions, which was our empha- 
sis, as a clear understanding of the composability issues directs us 
what can and cannot be done with sampled dataset. 



5. CONCLUSIONS 

We have answered the two questions we set out in the begin- 
ning of the paper. We take the approach of starting from both 
fc-anonymization and differential privacy and trying to meet in 
the middle. On the one hand, we identify weaknesses in the fc- 
anonymity notion and existing fc-anonymization methods and pro- 
pose the notion of safe fc-anonymization to avoid these privacy vul- 
nerabilities. On the other hand, we try to relax differential privacy 
to take advantage of the adversary's uncertainty of the data. The 
key insight underlying our results is that random sampling can be 
used to bridge this gap between fc-anonymization and differential 
privacy. 

We have explored both the power and potential pitfalls to take 
advantage of sampling in private data analysis or publishing. Our 
results show that sampling, when used correctly, is a powerful tool 
that can greatly benefit differential privacy, as it creates uncertainty 
for the adversary. Sampling can increase the privacy budget and 
error toleration bound. Sampling also enables the usage of algo- 
rithms such as safe fc-anonymization; however, this usage requires 
fresh sampling that is not used to answer any other query. An in- 
triguing open question is whether there exist approaches other than 
sampling that can create uncertainty for the adversary, that can tol- 
erate answering e-DP queries. 



9 



6. REFERENCES 

[1] R. Agrawal, R. Srikant, and D. Thomas. Privacy preserving 
olap. In Proceedings of the ACM SIGMOD International 
Conference on Management of Data (SIGMOD), pages 
251-262, 2005. 

[2] S. Agrawal and J. R. Haritsa. A framework for high-accuracy 
privacy-preserving mining. In Proceedings of the 
International Conference on Data Engineering (ICDE), 
pages 193-204, 2005. 

[3] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical 
privacy: the sulq framework. In PODS '05: Proceedings of 
the twenty-fourth ACM SIGMOD-SIGACT-SIGART 
symposium on Principles of database systems, pages 
128-138, New York, NY, USA, 2005. ACM. 

[4] A. Blum, K. Ligett, and A. Roth. A learning theory approach 
to non-interactive database privacy. In STOC, pages 
609-618, 2008. 

[5] J.-W. Byun, A. Kamra, E. Bertino, and N. Li. Efficient 

k-anonymization using clustering techniques. In Proceedings 
of the 12th international conference on Database systems for 
advanced applications, DASFAA'07, pages 188-200, 2007. 

[6] K. Chaudhuri and N. Mishra. When random sampling 
preserves privacy. In CTK/TO, pages 198-213, 2006. 

[7] I. Dinur and K. Nissim. Revealing information while 
preserving privacy. In PODS '03: Proceedings of the 
twenty-second ACM SIGMOD-SIGACT-SIGART symposium 
on Principles of database systems, pages 202-210, New 
York, NY, USA, 2003. ACM. 

[8] C. Dwork. Differential privacy. In ICALP, pages 1-12, 2006. 

[9] C. Dwork. Differential privacy: A survey of results. In 
TAMC, pages 1-19, 2008. 
[10] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and 
M. Naor. Our data, ourselves: Privacy via distributed noise 
generation. In S. Vaudenay, editor, EUROCRYPT, volume 
4004 of Lecture Notes in Computer Science, pages 486-503. 
Springer, 2006. 
[11] C. Dwork, F. McSherry, K. Nissim, and A. Smith. 

Calibrating noise to sensitivity in private data analysis. In 
TCC, pages 265-284, 2006. 
[12] C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and 
S. Vadhan. On the complexity of differentially private data 
release: efficient algorithms and hardness results. In STOC, 
pages 381-390, 2009. 
[13] C. Dwork and K. Nissim. Privacy-preserving datamining on 
vertically partitioned databases. In In CRYPTO, pages 
528-544. Springer, 2004. 
[14] B. Gedik and L. Liu. Protecting location privacy with 
personalized k-anonymity: Architecture and algorithms. 
IEEE Transactions on Mobile Computing, 7:1-18, January 
2008. 

[15] M. Gotz, A. Machanavajjhala, G. Wang, X. Xiao, and 
J. Gehrke. Privacy in search logs. CoRR, abs/0904.0682, 
2009. 

[16] Y. He and J. F. Naughton. Anonymization of set- valued data 
via top-down, local generalization. In Proceedings of the 
International Conference on Very Large Data Bases (VLDB), 
page ss, 2009. 

[17] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, 

S. Raskhodnikova, and A. Smith. What can we learn 
privately? In IEEE Symposium on Foundations of Computer 
Science (EOCS), pages 320-326, 1992. 

[18] S. P. Kasiviswanathan and A. Smith. A note on differential 



privacy: Defining resistance to arbitrary side information. 

CoRR, abs/0803.3946, 2008. 
[19] D. Kifer and B.-R. Lin. Towards an axiomatization of 

statistical privacy and utility. In Proceedings of the 

twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium 

on Principles of database systems of data, PODS '10, pages 

147-158, New York, NY, USA, 2010. ACM. 
[20] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas. 

Releasing search queries and clicks privately. In Proceedings 

of the International World Wide Web Conference (WWW), 

pages 171-180, 2009. 
[21] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian 

multidimensional fc-anonymity. In Proceedings of the 

International Conference on Data Engineering (ICDE), 

page 25, 2006. 
[22] N. Li, T. Li, and S. Venkatasubramanian. t-closeness: 

Privacy beyond k-anonymity and 1-diversity. In ICDE, pages 

106-115, 2007. 
[23] A. Machanavajjhala, J. Gehrke, D. Kifer, and 

M. Venkitasubramaniam. ^-diversity: Privacy beyond 

fc-anonymity. In Proceedings of the International Conference 

on Data Engineering (ICDE), page 24, 2006. 
[24] A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and 

L. Vilhuber. Privacy: Theory meets practice on the map. In 

ICDE, pages 277-286, 2008. 
[25] F. McSherry and K. Talwar. Mechanism design via 

differential privacy. In FOCS, pages 94-103, 2007. 
[26] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth 

sensitivity and sampling in private data analysis. In STOC, 

pages 75-84, 2007. 
[27] P. Samarati. Protecting respondents' identities in microdata 

release. IEEE Trans, on Knowl. and Data Eng., 

13:1010-1027, November 2001. 
[28] P. Samarati and L. Sweeney. Protecting privacy when 

disclosing information: k-anonymity and its enforcement 

through generalization and suppression. Technical report, 

SRI International, 1998. 
[29] L. Sweeney. Achieving fc-anonymity privacy protection 

using generalization and suppression. Int. J. Uncertain. 

Fuzziness Knowl. -Based Syst., 10(5):571-588, 2002. 
[30] L. Sweeney, fc-anonymity: A model for protecting privacy. 

Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 

10(5):557-570, 2002. 

APPENDIX 
A. PROOFS 

This appendix includes proofs not included in the main body. 

A.l Proof of Theorem [1] 

Theorem [T] Given an algorithm A that satisfies 
ei, (5i)-DPS, A also satisfies (/32, £2, 52)-DPS for any 
1^2 < /?!, where 

£2 =ln(l+ (|^(e! -1))) and52 = ff5i. 

Proof. We need to show that the algorithm A^'^ satisfies 
(e2,(52)-DP. Let ^ = The algorithm ^4"=^ can be viewed 
as first sampling with probability j3, then followed by applying the 
algorithm A^^ , which satisfies (ei, (5i)-DP. 

We use A.fi(D) to denote the process of sampling from D with 
sampling rate /3. Any pair D, D' can be viewed as D and D-t, 
where D-t denotes the dataset resulted from removing one copy 
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of t from D. For any O, let 

Z = Pr[A^^D) G O], andX = Pr[A^^D-t) G O], 
we want to show that 

{Z<e'^X + 52) A {X < e''' Z + S2) . 

We have 



X= ^ Pr[A^(L»_i) = r]Pr[yi''i(T) G O]. 

TCD^t 

To analyze Z, we note that all the T's that resulted from sam- 
pling from D with probability /3 can be divided into those in which 
t is not sampled, and those in which t is sampled. For a T in the 
former case, we have 

Pr[Aa(D) = T] = (1 - /3) Pr[Afi{D) = T\t not sampled in T] 
= (l-/3)Pr[A;3(i3_0 = r] 

For a T in the latter case, we have 

Pr[A;3(-D) =T] = /?Pr[A^(D) = T[t sampled in T] 
= /3Pr[A^(D_0 = r_i]. 

Hence we have 

Z = I:tcd_, (1 - /3) Pr[AMC-t) = T] Prf.A'^i (T) 6 O] 

+ i:T_,cc_,/3P>^[A/3(^-t) = T-t]Pr[^^MT-t) e O] 

Let 

F= ^ PT[A0{D.t)^T']Pr[A^^TU)eO], 

T'CD^t 

then we have Z = {1 - P)X + /3y. 
That A satisfies (/3i, ei, (5i)-DPS means that for each T, O 

Pr[A^^{T+t) G O] < Pt[A^^{T) G O] + 5i 
Hence we have 

Y < T.T'Qn_, Pr[A^(D_t) = T'] (e^i Pr[A^^ (T' e O] + 5i) , 

= e'l Et'cc_, Pr[A^(i?-t) = T'] Prf.A'^i (T') S O] 

+'5iET'CD_t PrfA^lD-t) =1^'] 
= e'lX + <5i. 

Hence we have 

Z = + 

< {1- 13)X + l3{e'^X + Si) 

< {l-P + l3e'^)X + pSi. 

To show that X < e^^ Z + S2, we observe that A satisfies 
(/?!, ei, (5i)-DPS means that 

X < e'^Y + Si, and hence 

Z^{1-P)X + PY>{1- I3)X + I3e-'^ {X - Si), 

and X < Z + Si 

- 1 - 13 + Pe-'i l-/? + e/J-^i 

We now show that 



1 -/? + /3e-^i 



< e 



^ < (e^i +e-^i -2)(;g-^2). 



Hence 



□ 



1 -/3 + e/3— 



-<5i < l3e-''e'^Si < pSi 



A.2 Proof of Theorem ID 

Theorem [5] Any strongly-safe k-anonymization algorithm sat- 
isfies {l3,e,S)-DPS for any < /3 < 1, e > -ln(l - fi), and 
S = d{k, j3, e), where the function d is defined as 

n 

d{k,p,e) = max ^ /(j;n,/3), 



where 7 = 



Proof. Let A denote the algorithm, and g be the data- 
independent generalization procedure in the algorithm. For any 
dataset D, any tuple t £ D, and for any output 5. For any 
e > — ln(l — /?), we show that the probability by which 



Pr[.4(D) ^ S] 
- Pv[A{D-t) = 5] - ^ 



(2) 



is violated is S. Note that this is a stronger version of (e, (5)-DP 
than the one in Definition|2] See 1 18] for relationship between the 
two. 

Let n be the number of t' in D such that g{t') — g{t). Let j be 
the number of times that g{t) appears in 5. Note that as the only 
difference between D and D-t is that D has one extra copy of t, 
we have. 



Pr[.4(D) = S] 



Pr[^(D) has j copies of g(t)] 



PT[AiD-t)) = S] Pr[^(D_t) has j copies of g(f)] 

Because any tuple that appears less than k times is suppressed, 
either j > k, or j = 0. When j — 0, we have 



Pr[^(D) = S] 



F{k-l;n,p) 



Pr[A{D-t) = S] F{k^l;n-l,p) J^':^^^ f{i;n ~ 1, 13)' 

Because F{k — 1; n, j3) is always less than F{k — 1; 71 — 1, 
hence pf['J(^jf "'^yjfgj < e'. Furthermore, we note that Vi G [0..fc — 



> (1 - 13). Hence 



PrlA{D) = S] 



(1 — f3). Because e > — ln(l — f3), we have e"^ < 1 — /3; hence 
under the case when j = 0, inequality ^ is satisfied. 
When j > k,we have 



Pt[A{D) = S] 



fU;n,l3) 



Pr[A{D-m) = S] /0>-l,/3) 



"(1-/3) \ • 

— ^ — ^ n > 7 
1 n<j. 



The choice of n can be arbitrary because it is determined by the 
choice of D. The value of j is determined by the choice of S. For 
some values of j, inequality ^ is violated. We want to compute 
the probabilities of these bad j's occurring. From the above, we 
know when j > n, the outcome is good. We now consider the bad 
outcomes when j < n. 



9^Let Xi's be random variables that take the value 1 with proba- 
bility /3, and with probability 1 — /3. F(fc — l;n — l,/3)is the prob- 
ability that the sum of — 1 such X's < fc — 1, and F(k — 1; n, /3) 
is the probability that the sum of n such X's is < k — 1. 
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Note that because e > — ln(l — /3), we have — e < ln(l — /3), 
and 



n~ j 



> 1 - /3 > e" 



Hence we only need to consider what j's make "'^^_P > e"- This 
occurs when j > L^_zl±Eh}.^ Let 7 = L^-^l+Sl then this occurs 
when j > 772. 

So far our analysis has shown that|5^ bad outcome 5* for an input 
D would satisfy the condition j > k and n > j > 771. Now we 
need to compute the probability that A{D) gives a bad outcome, 
and the probability that A(D-t) gives a bad outcome. The former 
is given below: 



j:(j>fcAj>7n) 



(3) 



And the latter is 



j:U>kAj>-in) 



l,/3). 



As the latter is smaller than the former, we only need to bound the 
former. 

Let Um ~ ^ — 1 , we now show that when n < rim, 

J2".{j>kAj>-yn) fij''^'P) increases when n increases. Note 
that the choice of iim satisfies the condition that "frim < k 
and 7(nm + 1) > k. Observe that when n < Um, the 
condition {j > k A j > jn) becomes j > k. The function 
X]"- ■>k f(j'' monotonically increasing with respect to n. 

When n > rim, the condition (j > k A j > 'yn) becomes j > 
7n. (In fact, when n = rim + 1, the smallest j to satisfy j > 771 is 
k + 1.) Hence the error probability is bounded by 



<5 = d{k,l3,e) ■ 



f(r,n,P), where 7 ; 



(e' -1+/3) 



ratio rCnl - P'-[9(Ag(P))-Sl 

ratio r{g) - Pr[s(A3(D_t))=s] '^l"^'^ 

rig) = <! Eto /(i;"-!.") 



ifj = 0; 
if A; < j < n 



where j is the number of copies of g{t) in the output dataset S. So, 
the differential privacy ratio UJ can be upper bounded, 

Pr[A(D) = S] 
Pr[A{D_t) = S] 
_ SsEGP'-[-^.»»(P) = gl-P^[9(A3(P))-gl 

EgeGP'^l-4".(-D-t) = 9l'P''l9(A^(-D_t)) = Sl 
, e'l EggG Pr[>t„.(0-t) = 9l'Pr[9(A^(g)) = S] 

- EgeGP'-[-^™(0-t)=9l-Pr[9(A3(D_t)) = S] 

, <^'^r{a)Y.g^GVr[A^(D_-,) = g]-Pr[g(Af,(D_t)) = S] 

- T.geG Pr[.A„(0-t) = 9]'Pi-[9(A3(0_t)) = Sl 

= e''r{g). 

The lower bound can be obtained in a similar way. So, 



'r{g) < 



Pt[A^{D) = S] 



S] 



< e'r{g). 



By the proof of Theorem [6] the ratio r{g) is bounded by 
g-Ce-ei) < < e^^^'i'. The probability that it is violated 

is the probability that inequality (O is violated. In the j = case, 

< ^^/:"^"f' < e(-^i), since . > - ln(l - /?) + 



And only when "'^^jj^' > e'-'i 
l[4ll is violated. Let 7 = 



ei. And for the < j < n case, > (1 - ^8) > e-'=+'=i 



u > 



-), inequality 



-1+|3) 



. The error probability 5 is 



(5 = d(fc,/3,e - ei) = max f{j;n, 13), 

„;„>ri_i] .t-^ 



where 7 = 



□ 



□ 



A.3 Proof of Theorem |6] 

Theorem |6l A«j ei-safe k-anonymization algorithm satisfies 
{P, e, 5)-DPS, w/iere e > - ln(l - /?) + ei, 5 = /3, e - e/) = 

max E">,„ /(i; ^, 7 = ^2^9^^. 



Proof. Let ^ denote the ei-safe fc-anonymization algorithm. 
Here, we want to show that for any e > — ln(l — /3) + ei, D, 
t € D and 5", 

< ^r[A{D) ^ S] ^ 

- Pv[A{D-t) = S]-^ ^ ' 

is valid for probability at least 1 — 5. Let Ap denote the process 
of binomial sampling the dataset D with probability /3. And let G 
denote the set of all the possible outputs of ^'s subroutine Am- By 
definition, its subroutine Am satisfies ei -differential privacy. 



- Pr[AmiD-t) 
And, according to the proof of Theorem |6l for a fixed g £ G, the 
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